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Abstract 

We propose a new first-order optimisation algorithm to solve high-dimensional non-smooth composite 
minimisation problems. Typical examples of such problems have an objective that decomposes into 
a non-smooth empirical risk part and a non-smooth regularisation penalty. The proposed algorithm, 
called Semi-Proximal Mirror-Prox, leverages the Fenchel-type representation of one part of the objective 
while handling the other part of the objective via linear minimization over the domain. The algorithm 
stands in contrast with more classical proximal gradient algorithms with smoothing, which require the 
computation of proximal operators at each iteration and can therefore be impractical for high-dimensional 
problems. We establish the theoretical convergence rate of Semi-Proximal Mirror-Prox, which exhibits 
the optimal complexity bounds, i.e. 0(l/e^), for the number of calls to linear minimization oracle. We 
present promising experimental results showing the interest of the approach in comparison to competing 
methods. 


1 Introduction 

A wide range of machine learning and signal processing problems can be formulated as the minimization of 
a composite objective: 

min F{x) := f{x) + \\Bx\\ (1) 

x£X 

where X is closed and convex, / is convex and can be either smooth, or nonsmooth yet enjoys a particular 
structure. The term \\Bx\\ defines a regularization penalty through a norm || ■ 11; ^ ^ linear 

mapping on a closed convex set X. The function / usually corresponds to an empirical risk, that is an 
empirical average of a possibly non-smooth loss function evaluated on a set of data-points, while x encodes 
the learning parameters. All in all, the objective F has a doubly non-smooth structure. 

In many situations, the objective function F of interest enjoys a favorable structure, namely a so-called 
Fenchel-type representation [Zmilll]: 


f{x) = max {{x,Az)-ip{z)} (2) 

z^Z 

where Z is convex compact subset of a Euclidean space, and is a convex function. Sec. will give 
several examples of such situations. Fenchel-type representations can then be leveraged to use first-order 
optimisation algorithms. 

*The authors would like to thank Anatoli Juditsky and Arkadi Nemirovski for fruitful discussions. This work was supported 
by the NSF Grant CMMI-1232623, the LabEx Persyval-Lab (ANR-ll-LABX-0025), the project Titan (CNRS-Mastodons), 
the project Macaron (ANR-14-CE23-0003-01), the MSR-Inria joint centre, and the Moore-Sloan Data Science Environment at 
NYU. 
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A simple first option to minimise F is using the so-called Nesterov smoothing technique m along with a 
proximal gradient algorithm [28] , assuming that the proximal operator associated with X is computationally 
tractable and cheap to compute. However, this is certainly not the case when considering problems with 
norms acting in the spectral domain of high-dimensional matrices, such as the matrix nuclear-norm |13j 
and structured extensions thereof In the latter situation, another option is to use a smoothing 

technique now with a conditional gradient or Frank-Wolfe algorithm to minimize F, assuming that a a linear 
minimization oracle associated with X is cheaper to compute than the proximal operator [zmiiiMi- Neither 
option takes advantage of the composite structure of the objective Q or handles the case when the linear 
mapping B is nontrivial. 

Contributions Our goal in this paper is to propose a new first-order optimization algorithm, called Semi- 
Proximal Mirror-Prox , designed to solve the difficult non-smooth composite optimisation problem Q, which 
does not require the exact computation of proximal operators. Instead, the Semi-Proximal Mirror-Prox relies 
upon i) Fenchel-type representability of /; ii) Linear minimization oracle associated with || -H in the domain X. 
While the Fenchel-type representability of / allows to cure the non-smoothness of /, the linear minimisation 
over the domain X allows to tackle the non-smooth regularisation penalty || • ||. We establish the theoretical 
convergence rate of Semi-Proximal Mirror-Prox, which exhibits the optimal complexity bounds, i.e. 0(l/e^), 
for the number of calls to linear minimization oracle. Furthermore, Semi-Proximal Mirror-Prox generalizes 
previously proposed approaches and improves upon them in special cases: 

1. Case B = 0: Semi-Proximal Mirror-Prox does not require assumptions on favorable geometry of dual 
domains Z or simplicity of '!/'(•) in ([^. 

2. Case B = 1: Semi-Proximal Mirror-Prox is competitive with previously proposed approaches |161124] 
based on smoothing techniques. 

3. Case of non-trivial B: Semi-Proximal Mirror-Prox is the first proximal-free or conditional-gradient-type 
optimization algorithm for Q. 

Related work The Semi-Proximal Mirror-Prox algorithm belongs the family of conditional gradient algo¬ 
rithms, whose most basic instance is the Frank-Wolfe algorithm for constrained smooth optimization using 
a linear minimization oracle; see [nmiii]. Recently, in dm], the authors consider constrained non-smooth 
optimisation when the domain Z has a “favorable geometry”, i.e. the domain is amenable to linear minimi¬ 
sation (favorable geometry), and establish a complexity bound with 0(l/e^) calls to the linear minimization 
oracle. Recently, in m, a method called conditional gradient sliding is proposed to solve similar problems, 
using a smoothing technique, with a complexity bound in 0(l/e^) for the calls to the linear minimization ora¬ 
cle (LMO) and additionally a 0(1/e) bound for the linear operator evaluations. Actually, this 0(l/e^) bound 
for the LMO complexity can be shown to be indeed optimal for conditional-gradient-type or LMO-based 
algorithms, when solving genera|^ non-smooth convex problems |15| . 

However, these previous approaches are appropriate for objective with a non-composite structure. When 
applied to our problem the smoothing would be applied to the objective taken as a whole, ignoring 
its composite structure. Conditional-gradient-type algorithms were recently proposed for composite objec¬ 
tives [HiiiniiMiiHiini, but cannot be applied for our problem. In cni, / is smooth and B is identity matrix, 
whereas in [24] . / is non-smooth and B is also the identity matrix. The proposed Semi-Proximal Mirror- 
Prox can be seen as a blend of the successful components resp. of the Composite Conditional Gradient 
algorithm m and the Composite Mirror-Prox that enjoys the optimal complexity bound 0(l/e^) on 
the total number of LMO calls, yet solves a broader class of convex problems than previously considered. 

^ Related research extended such approaches to stochastic or online settings 1111 l9l 1161 ; such settings are beyond the scope 
of this work. 
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Outline The paper is organized as follows. In Section 2, we describe the norm-regularized nonsmooth 
problem of interest and illustrate it with several examples. In Section 3, we present the conditional gradient 
type method based on an inexact Mirror-Prox framework for structured variational inequalities. In Section 4, 
we present promising experimental results showing the interest of the approach in comparison to competing 
methods, resp. on a collaborative filtering for movie recommendation and link prediction for social network 
analysis applications. 


2 Framework and assumptions 

We present here our theoretical framework, which hinges upon a smooth convex-concave saddle point re¬ 
formulation of the norm-regularized non-smooth minimization We shall use the following notations 
throughout the paper. For a given norm || • ||, we define the dual norm as ||s||* = max||a;||<i(s,x). For any 
X e ||x ||2 = ||x||f = (E™ 1 E "=1 

Problem We consider the composite minimization problem 

Opt = min /(x) -k \\Bx\\ (3) 

xGX 

where X is a closed convex set in the Euclidean space E^; x <—>■ Bx is a linear mapping from X to y(D BX), 
where T is a closed convex set in the Euclidean space Ey. We make two important assumptions on the 
function / and the norm || • || defining the regularization penalty, explained below. 

Fenchel-type Representation The non-smoothness of / can be challenging to tackle. However, in many 
cases of interest, the function / enjoys a favorable structure that allows to tackle it with smoothing techniques. 
We assume that the norm /(x) is a non-smooth convex function given by 

/(x) = max $(x, z) (4) 

a where <I>(x,z) is a smooth convex-concave function and Z is a convex and compact set in the Euclidean 
space Ez- Such representation was introduced and developed in [7l[T2l[T4], for the purpose of non-smooth 
optimisation. Fenchel-type representability can be interpreted as a general form of the smoothing-favorable 
structure of non-smooth functions used in the Nesterov smoothing technique [H]. Representations of this 
type are readily available for a wide family of “well-structured” nonsmooth functions /; see Sec. for 
examples. 

Composite Linear Minimization Oracle Proximal-gradient-type algorithms require the computation 
of a proximal operator at each iteration, i.e. 

mn |^||y||2-k -ka||j/||| . (5) 

For several cases of interest, described below, the computation of the proximal operator can be expensive or 
intractable. A classical example is the nuclear norm, whose proximal operator boils down to singular value 
thresholding, therefore requiring a full singular value decomposition. In contrast to the proximal operator, 
the linear minimization oracle can much cheaper. The linear minimization oracle (LMO) is a routine which, 
given an input a > 0 and rj G Ey, returns a point 

min {(77,y)-ka||?/||} (6) 

y&Y 

In the case of the nuclear-norm, the LMO only requires the computation of the top pair of eigenvec¬ 
tors/eigenvalues, which is an order of magnitude fast in time-complexity. 
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Saddle Point Reformulation. The crux of our approach is a smooth convex-concave saddle point re¬ 
formulation of After massaging the saddle-point reformulation, we consider the variational inequality 
associated with the obtained saddle-point problem. For a constrained smooth optimisation problem, the 
corresponding variational inequality provides the sufficient and necessary condition for an optimal solution 
to the problem nil]. For non-smooth optimization problems, the corresponding variational inequality is 
directly related to the accuracy certificate used to guarantee the accuracy of a solution to the optimisation 
problem; see Sec. 2.1 in [T^] and m- We shall present then an algorithm to solve the variational inequality 
established below, that leverages its particular structure. 

Assuming that / admits a Fenchel-type representation Q, we rewrite @ in epigraph form 

min max {$(a;, z) + t : y = Bx} , 

x^X,y^Y,T>\\y\\ z&Z 


which, with a properly selected p > 0, can be further approximated by 

Opt = min Tciax {^(x, z) + T + p\\y — Bx\\ 2 \ 

x&X,y(iY,T>\\y\\ z^Z 

= min max \^{x, z) + t-\- ply — Bx,w)\ . 

x&X,y£Y,T>\\y\\ z£Z,\\w\\2<l 


(7) 

( 8 ) 


In fact, when p is large enough one can always guarantee Opt = Opt. It is indeed sufficient to set p as the 
Lipschitz constant of || • || with respect to |j • || 2 . 

Introduce the variables u := [x,y\z,w] and v := r. The variational inequality associated with the above 
saddle point problem is fully described by the domain 


= {x+= [u\v\: x € X,y €Y,z € Z,\\w\\2<1,t >\\y\\) 


and the monotone vector field 

F{x+ = = [Fu{u);Fy] , 

where 


/ 

X 

] 


Va;<i>(a;, z) — pB'^w 


y 



pw 

u = 

z 



-V^^{x,z) 

V 

w 



1 

'to 

1 

1 _ 


In the next section, we present an efficient algorithm to solve this type of variational inequality, which enjoys 
a particular structure; we call such an inequality semi-structured. 


3 Semi-Proximal Mirror-Prox for Semi-structured Variational In¬ 
equalities 

Semi-structured variational inequalities (Semi-VI) enjoy a particular product structure, that allows to get the 
best of two worlds, namely the proximal setup (where the proximal operator can be computed) and the LMO 
setup (where the linear minimization oracle can be computed). Basically, the domain X is decomposed as a 
Cartesian product over two sets X = Xi x A 2 , such that Xi admits a proximal-mapping while A 2 admits a 
linear minimization oracle. We now describe the main theoretical and algorithmic components of the Semi- 


Proximal Mirror-Prox algorithm, resp. in Sec. 3.1 and in Sec. |3.2[ and finally describe the overall algorithm 
in Sec. 13.31 


3.1 Composite Mirror-Prox with Inexact Prox-mappings 

We first present a new algorithm, which can be seen as an extension of the composite Mirror Prox algorithm, 
denoted CMP for brevity, that allows inexact computation of the Prox-mappings, and can solve a broad class 
of variational inequalites. The original Mirror Prox algorithm was introduced in |18| . and was extended to 
composite minimization in [12] assuming exact computations of Prox-mappings. 
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Structured Variational Inequalities. We consider the variational inequality VI(V, F): 

Find G X : {F{x),x — a;*) > 0, Va; € V 
with domain X and operator F that satisfy the assumptions (A.1)-(A.4) below. 

(A.l) Set X C Eu X Ey is closed convex and its projection PX = {u : x = [m; u] G X} C U, where U is 
convex and closed, Ey,Ey are Euclidean spaces; 

(A. 2) The function uj{-) : U —)■ R is continuously differentiable and also 1-strongly convex w.r.t. some norrr0 
II • II. This defines the Bregman distance 

Vu{u') = Uj(u') - Uj(u) - (uj'(u), u' -u) > ^||m' - u||^ . 

(A. 3) The operator E{x = [rt,u]) : A —)• F„ x is monotone and of form E{u,v) = [Eu{u);Ey] with 
Fy G Ey being a constant and Fu(u) G F„ satisfying the condition 

Vm, u' G U : ||F„(m) - F„(u')||, < L||m - u'|| + M 

for some L < oo, M < oo; 

(A.4) The linear form {Ey,v) of [m;u] G Ey x Ey is bounded from below on X and is coercive on X w.r.t. 
v: whenever G A, t = 1,2,... is a sequence such that is bounded and ||w ‘||2 —t oo as 

t -G oo, we have {Ey, v*) -G oo, t -G oo. 

e-Prox-mapping In the Composite Mirror Prox with exact Prox-mappings the quality of an iterate, 
in the course of the algorithm, is measured through the so-called dual gap function 

evi(x|A,F) = sup {E{y),x-y) . 
yex 

We give in Appendix|^a refresher on dual gap functions, for the reader’s convenience. We shall establish the 
complexity bounds in terms this dual gap function for our algorithm, which directly provides an accuracy 
certificate along the iterations. However, we first need to define what we mean by an inexact prox-mapping. 
Inexact proximal mapping were recently considered in the context of accelerated proximal gradient algo¬ 
rithms [25] . The definition we give below is more general, allowing for non-Euclidean proximal-mappings. 

We introduce here the notion of e-prox-mapping (e > 0). For ^ = [ 77 ; (] G Ey x Ey and x = [u; v] G A, let 
us define the subset of A as 

PxiCj = {x = [u; x] G A : (77 -I- uj'{u) — uj'{u), u — s) + {C,,v — w) < e V[s; w] G A}. 

When e = 0, this reduces to the exact prox-mapping, in the usual setting, that is 

PxiO = Argmin {( 77 , s) -b {C,w) + 14(s)} ■ 

When e > 0, this yields our definition of an inexact prox-mapping, with inexactness parameter e. Note that 
for any e > 0, the set P^{^ = [ 77 ; 7 Fu]) is well defined whenever 7 > 0. The Composite Mirror-Prox with 
Inexact Prox-mappings is outlined in Algorithm [l] 

^There is a slight abuse of notation here. The norm here is not the same as the one in problem 
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Algorithm 1 Composite Mirror Prox Algorithm (CMP) for VI(A, F) 

Input: stepsizes 7 * > 0, inexactness Ct > 0, t = 1,2,... 

Initialize G X 

for t — 1, 2,..., T do 


y*:=[u*-y] e P^Ji-ftFix*)) = P^Ji^t[Fu{u*);F^]) 
^t+i _ g P^iijtF{y^)) = P^t{7t[Fu{u^);F^]) 

end for ^ 

Output: XT ■■= [ut; vt\ = (ELi 7 t) ELi 7ty‘ 


Note that this composite version of Mirror Prox algorithm works essentially as if there were no v- 
component at all. Therefore, the proposed algorithm is a not-trivial extension of the Composite Mirror-Prox 
with exact prox-mappings, both from a theoretical and algorithmic point of views. We establish below the 
theoretical convergence rate; see Appendix for the proof. 

Theorem 3.1. Assume that the sequence of step-sizes (yt) in the CMP algorithm satisfy 

Ct := 7 t(F„(u‘) - - 14* (u‘) < 7t . t = l,2,...,T . (10) 


Then, denoting 0[A] = sup|„.„]g;i(-14* (u), for a sequence of inexact prox-mappings with inexactness Ct > 0, 
we have 

eYi{xT\X,F) := sup {F[x),Xt-x) < QW + 

“ 6 ^ Et=i It 

Remarks Note that the assumption on the sequence of step-sizes ( 7 *) is clearly satisfied when 74 < 
When M = 0, it is satisfied as long as 74 < L~^. 

Corollary 3.1. Assume further that X = Xi x X 2 , and let F be the monotone vector field associated with 
the saddle point problem 

SadVal = min max $(a:^,x^), (12) 


X^GXi x‘^£X 2 


two induced convex optimization problems 


Opt(P) = min 2 ;igxi [$(a;^) = sup,,,2gX2 

Opt(£>) = max 2 , 2 gX 2 [ 1 ( 2 ;^) =in4iGXi ^’(a:4a;2) 


iP) 

(D) 


(13) 


with convex-concave locally Lipschitz continuous cost function $. In addition, assuming that problem (P) in 
(13) is solvable with optimal solution x\ and denoting by x\- the projection of xt G X = Xi x X 2 onto Xi, 


we have 


<I)(x^) — Opt(P) < 


E 


n -1 r 


It 


0[{x:} X A 2 ] + + 2ELi 


Ct 


(14) 


The theoretical convergence rate established in Theorem |3.1| and Corollary |3.1| generalizes the previous 
result established in Corollary 3.1 in m for CMP with exact prox-mappings. Indeed, when exact prox- 
mappings are used, we recover the result of [12j . When inexact prox-mappings are used, the errors due to 
the inexactness of the prox-mappings accumulates and is reflected in the bound (34) and (14). 


3.2 Composite Conditional Gradient 


We now turn to a variant of the composite conditional gradient algorithm, denoted CCG, tailored for a 
particular class of problems, which we call smooth semi-linear problems. The composite conditional gradient 
algorithm was introduced in m- We present an extension here which will turn to be especially tailored for 


sub-problems that will be solved in Sec. 3.3 
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Minimizing Smooth Semi-linear Problems. We consider the smooth semi-linear problem 


min = (j){u) + {6,v)} (15) 

x—[u;v]^X 

represented by the pair (X; ())+) such that the following assumptions are satisfied. We assume that 

i) X C Eu X Ey is closed convex and its projection PX C U, where U is convex and compact; 

ii) (/>(u) :[/—>■ R be a convex continuously differentiable function, and there exists 1 < k < 2 and L < oo 
such that 

< (j){u) + {X(f>{u),u' — u) -\—— m||'' G C/; (16) 

K 

iii) 9 € Ey he such that every linear function on Ey x Ey of the form 

[u;v]^ {r],u) + {9,v) (17) 

with rj € Ey attains its minimum on X at some point x[rj\ = [^[ 77 ]; z;)//]]; we have at our disposal a 
Composite Linear Minimization Oracle (LMO) which, given on input rj £ Ey, returns x[r]]. 


Algorithm 2 Composite Conditional Gradient Algorithm CCG{X,(p{-),6;e) 

Input: accuracy e > 0 and 74 = 2/(< -|- 1), t = 1, 2,... 

Initialize x^ = G X and 

for t = 1, 2, ... do 

Compute 5t = {gt,u* - u^[gt]) + {9,v* - v*[gt]), where gt = V(/)(m*); 
if St < e then 

Return x* = [m*; x*] 
else 

Update g A such that (/)+(x*+i) < <p+ {x* +-ft{x*[gt] - x*)) 

end if 
end for 


The algorithm is outlined in Algorithm Note that CCG works essentially as if there were no v- 
component at all. The CCG algorithm enjoys a convergence rate in 0{t~^^~^^) in the evaluations of the 
function (/)+, and the accuracy certificates {St) enjoy the same rate 0{t~^'^~^'>) as well, for solving problems 
of type (151. See Appendix for details and the proof. 


Proposition 3.1. Denote D the 
iterates (x*) of CCG satisfies 


-diameter of U. When solving problems of type (15), the sequence of 
2LoD^ 


et := «!)+(x‘)-min((.+ (x) < ni.,. 

xex k{3 — k) yt-|-l 


K— 1 


, f > 2 


In addition, the accuracy certificates {St) satisfy 


min Sg < 0{1)LqD^ 

l<S<t 


2 

t 1 


,t>2. 


(18) 


(19) 


3.3 Semi-Proximal Mirror-Prox for Semi-structured Variational Inequality 

We now give the full description of a special class of variational inequalities, called semi-structured variational 
inequalities. This family of problems encompasses both cases that we discussed so far in Section |3.1| and 
|3.2[ But most importantly, it also covers many other problems that do not fall into these two regimes and 
in particular, our essential problem of interest ([^. 
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Semi-structured Variational luequalities. The class of semi-structured variational inequalities allows 
to go beyond Assumptions (A.l) — (A.4), by assuming more structure. This structure is consistent with 
what we call a semi-proximal setup, which encompasses both the regular proximal setup and the regular 
linear minimization setup as special cases. Indeed, we consider a class of variational inequality VI(A, F) 
that satisfies, in addition to Assumptions (A.l) — (A.4), the following assumptions: 

(S. 1) Proximal setup for X: we assume that x , E^ = Ey^ x Ey^ , and U CU 1 XU 2 , X = X 1 XX 2 

with Xi G Ey. X Ey. and PiX = {m : [uf, Vi] G Xi} C Ui for i = 1,2, where Ui is convex and closed, U 2 
is convex and compact. We also assume that uj{u) = wi(ui) -|-a; 2 (w 2 ) and ||m|| = ||ui||e„^ -|- ||m 2 ||£;„ 2 > 
with ui 2 {-) '■ U 2 ^ H continuously differentiable such that 

UJ2{u2) < UJ2{U2) + {Xu}2{u2),U2 - U 2 ) + —||m2 “ U2\\% ,Vm2,W2 ^ ^2] 

K “2 

for a particular 1 < k < 2 and Lq < oo. Furthermore, we assume that the || ■ lU„; ^-diameter of U 2 is 
bounded by some D > 0.. 

(S.2) Proximal mapping on Xi: we assume that for any rji G Ey-,^ and a > 0, we have at disposal easy-to- 
compute prox-mappings of the form. 


ProX(^j(? 7 i,a) := min {oJi{ui){pi,Ui)a{Fy^,vi)} . 

xi — [ui;vi]GXi 


(S.3) Linear minimization on A 2 : we assume that we we have at our disposal Composite Linear Minimization 
Oracle (LMO), which given any input rj 2 G Ey^ and a > 0, returns an optimal solution to the 
minimization problem with linear form, that is, 

LMO(772,a) := min {{'q 2 ,U 2 ) + a{Ey^,V 2 )} ■ 

X2 — W2]'1^2\^X2 


Semi-proximal setup We denote such problems as Semi-VI(A, F). On the one hand, when U 2 is a 
singleton, we get the full-proximal setup. On the other hand, when C/i is a singleton, we get the full linear- 
minimization-oracle setup (full LMO setup). In the gray zone in between, we get the semi-proximal setup. 


The Semi-Proximal Mirror-Prox algorithm. We finally present here our main contribution, the Semi- 
Proximal Mirror-Prox algorithm, which solves the semi-structured variational inequality under (A.l) — (A.4) 
and (S.l) — (S.3). The Semi-Proximal Mirror-Prox algorithm blends both CMP and CCG. Basically, for 
sub-domain X 2 given by LMO, instead of computing exactly the prox-mapping, we mimick inexactly the 
prox-mapping via a conditional gradient algorithm in the composite Mirror Prox algorithm. For the sub- 
domain Xi, we compute the prox-mapping as it is. 


Course of the Semi-Proximal Mirror-Prox algorithm 

by computing the exact prox-mapping and update = [u 2 ; vl 
algorithm to problem (15) specifically with 


Basically, at step t, we first update y\ = [u \; llj] 
] by running the composite conditional gradient 


A = A 2 , (/)(•) = a; 2 (-) + iltFy^iul) - ^^(ua),-), and 9 = 'ytFy^, 


until < 5 ( 2 / 2 ) = ivi)> 92 ~ 2 / 2 ) ^ Ci- We then update = 

similarly except this time taking the value of the operator at point y*. 
|3.1|and Proposition 3.1 we arrive at the following complexity bound. 


[u{+^-vl+^]and = 

Combining the results in Theorem 




Algorithm 3 Semi-Proximal Mirror-Prox Algorithm for Semi-VI(A, F) 

Input: stepsizes 7 * > 0, accuracies Ct > 0, t = 1,2,... 

[1] Initialize € X, where xl = [u\]v\]\x\ = 

for t = 1, 2,..., T do 

[2] Compute y* = [y\',y^ that 

y\-.= [u\]v\] = 

2/2 := = CCG(A2,a;2(-) + {jtFuJul) - 

[3] Compute a;‘+^ = [x\^^-,x^^^] that 

a;‘i+i := 1;!+^] = Prox^^ [itFu-, {u\) - a;i(u‘i), 74) 

4+1 := [u‘+';u*+i] = CCG{X2,U2{-) + {ltFuAuy)-u:'2{uy),-),ltF.FXt) 

end for ^ 

Output: XT ■■= [mt; vt\ = (ELi 7t) ELi 7*4 


Proposition 3.2. Under the assumption (A.l) — (A. 4) and (5.1) — (5.3) with M = 0, for the outlined 
algorithm to return an e-solution to the variational inequality VI{X,F), the total number of Mirror Prox 
steps required does not exceed 


Total number of steps = O 


fLQ[X]\ 


and the total number of calls to the Linear Minimization Oracle does not exceed 


Af = 0(1) 


LqL'^D^ 


' 0[A]. 


In particular, if we use Euclidean proximal setup on U 2 with UJ 2 {-) = 5 ||a; 2 |P, which leads to k = 2 and 
Lq = 1, then the number of LMO calls does not exceed M = 0(1) {L?D'^{Q[Xi] + /e^. 


Discussion The proposed Semi-Proximal Mirror-Prox algorithm enjoys the optimal complexity bounds, 
i.e. 0(l/e^), in the number of calls to LMO; see [TS] for the optimal complexity bounds for general non¬ 
smooth optimisation with LMO. Furthermore, Semi-Proximal Mirror-Prox generalizes previously proposed 
approaches and improves upon them in special cases of problem (|^; see Appendix. 


4 Experiments 

We present here illustrations of the proposed approach. We report the experimental results obtained with the 
proposed Semi-Proximal Mirror-Prox, denoted Semi-MP here, and state-of-the-art competing optimization 
algorithms. We consider three different models, all with a non-smooth loss function and a nuclear-norm 
regularization penalty: i) matrix completion with £2 data fidelity term; ii) robust collaborative filtering 
for movie recommendation; iii) link prediction for social network analysis. For i) & ii), we compare to 
two competing approaches: a) smoothing conditional gradient proposed in [53] (denoted Smooth-CG); b) 
smoothing proximal gradient (unii) equipped semi-proximal setup (Semi-SPG). For iii), we compare to 
Semi-LPADMM, using [55], and solving proximal mapping through conditional gradient routines. Additional 
experiments and implementation details are given in Appendix [E] 

Matrix completion on synthetic data We consider the matrix completion problem, with a nuclear- 
norm regularisation penalty and an £2 data-fidelity term. We first investigate the convergence patterns of 
our Semi-MP and Semi-SPG under two different strategies of the inexactness, a) fixed inner CG steps and 
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Figure 1: Matrix completion on synthetic data (1024 x 1024): optimality gap vs the LMO calls. 
From left to right: (a) Semi-MP; (b) Semi-SPG ; (c) Smooth-CG; (d) best of three algorithms. 
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Figure 2: Robust collaborative filtering and link prediction: objective function vs elapsed time. 
From left to right: (a) MovieLens lOOK; (b) MovieLens IM; (c) Wikivote(1024); (d) Wikivote(full) 


b) decaying et = c/t as the theory suggested. The plots in Fig. [^indicate that using the second strategy 
with 0{l/t) decaying inexactness provides better and more reliable performance than using fixed number of 
inner steps. Similar trends are observed for the Semi-SPG. One can see that these two algorithms based on 
inexact proximal mappings are notably faster than applying conditional gradient on the smoothed problem. 

Robust collaborative filtering We consider the collaborative filtering problem, with a nuclear-norm 
regularisation penalty and an ^i-loss function. We run the above three algorithms on the the small and 
medium MovieLens datasets. The small-size dataset consists of 943 users and 1682 movies with about lOOK 
ratings, while the medium-size dataset consists of 3952 users and 6040 movies with about IM ratings. We 
follow [Mj to set the regularisation parameters. In Fig. we can see that Semi-MP clearly outperforms 
Smooth-CG, while it is competitive with Semi-SPG. 

Link prediction We consider now the link prediction problem, where the objective consists a hinge-loss for 
the empirical risk part and multiple regularization penalties, namely the .^i-norm and the nuclear-norm. For 
this example, applying the Smooth-CG or Semi-SPG would require two smooth approximations, one for hinge 
loss term and one for £i norm term. Therefore, we consider another alternative approach, Semi-LPADMM, 
where we apply the linearized preconditioned ADMM algorithm |22j by solving proximal mapping through 
conditional gradient routines. Up to our knowledge, ADMM with early stopping is not fully theoretically 
analysed in literature. However, from an intuitive point of view, as long as the accumulated error is controlled 
sufficiently, such variant of ADMM should converge. 

We conduct experiments on a binary social graph data set called Wikivote, which consists of 7118 nodes 
and 103,747 edges. Since the computation cost of these two algorithms mainly come from the LMO calls, 
we present in below the performance in terms of number of LMO calls. For the first set of experiments, we 
select top 1024 highest degree users from Wikivote and run the two algorithms on this small dataset with 
different strategies for the inner LMO calls. 

In Fig.|^ we observe that the Semi-MP is less sensitive to the inner accuracies of prox-mappings compared 
to the ADMM variant, which sometimes stops progressing if the prox-mapping of early iterations are not 
solved with sufficient accuracy. The results on the full dataset corroborate the fact that Semi-MP outperforms 
the semi-proximal variant of the ADMM algorithm. 
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In this Appendix, we provide additional material on variational inequalities and non-smooth optimisa¬ 
tion algorithms, give the proofs on the main theorems, and provide additional information regarding the 
competing algorithms based on smoothing techniques and the implementation details for different models. 

A Preliminaries: Variational Inequalities and Accuracy Certifi¬ 
cates 

For the reader’s convenience, we recall here the relationship between variational inequalities, accuracy cer- 
tihcates, and execution protocols, for non-smooth optimization algorithms. The exposition below is directly 
taken from HU, and recalled here for the reader’s convenience. 

Execution protocols and accuracy certificates. Let A be a nonempty closed convex set in a Euclidean 
space E and F{x) : A —>• E be a vector field. 

Suppose that we process (A, F) by an algorithm which generates a sequence of search points Xt G X, 
t = 1 , 2 ,..., and computes the vectors F(xt), so that after t steps we have at our disposal t-step execution 
protocol It = {xt, F{xr)YT-i- By definition, an accuracy certificate for this protocol is simply a collection 
A* = of nonnegative reals summing up to 1. We associate with the protocol It and accuracy 

certificate A* two quantities as follows: 

• Approximate solution x^{It, A*) := X]t=i K-^r, which is a point of A; 

• Resolution Res(A'|lj, A‘) on a subset A' ^ 0 of A given by 

t 

Res(A'|lt, A‘) = sup V \\.{F{xt),Xt — x). (20) 

The role of those notions for non-smooth optimization is explained below. 

Variational inequalities. Assume that F is monotone, i.e.,VI(X,F) 

{F{x) - F{y),x - y) >0, 'ix,y G X . (21) 

Our goal is to approximate a weak solution to the variational inequality (v.i.) VI(A, E) associated with 
(A, E). A weak solution is defined as a point x* G X such that 

{Fiy),y-xY>0yyGX. (22) 

A natural (in)accuracy measure of a candidate weak solution x G X to VI(A, E) is the dual gap function 

eYi{x\X,F) = sup {F{y),x - y) (23) 

ycx 

This inaccuracy is a convex nonnegative function which vanishes exactly at the set of weak solutions to the 
VI(A,E). 

Proposition A.l. For every t, every execution protocol It = {xt G X,F{xr)Yr—i every accuracy 
certificate A* one has x* := x*{It,X*) G X. Besides this, assuming F monotone, for every closed convex set 
X' C X such that x* G X' one has 

evi(:r‘|A',E) < Res(A'|lt,A‘). (24) 

Proof. Indeed, x* is a convex combination of the points Xr G X with coefficients A*, whence x* G X. 
With A' as in the premise of Proposition, we have 

t t 

VyG X' : {F{y),x^ -y) ='^Xl{F{y),Xr - y) < Xl{F{xr),Xr - y) < Res(A'|Jt, A*), 

T—1 T—1 

where the first < is due to monotonicity of E. □ 
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Convex-concave saddle point problems. Now let X = Xi x X2, where Xi is a closed convex subset 
in Euclidean space Ei, i = 1,2, and E = Ei x E2, and let : Xi x X2 —> R be a locally Lipschitz 

continuous function which is convex in x^ € Xi and concave in x^ € X 2 . Xi,X 2 , $ give rise to the saddle 
point problem 

SadVal = min max <l>(x^,x^), (25) 

xieXix2eX2 

two induced convex optimization problems 


Opt(P) 

= min 

$(xi) 

= sup d>(x\x^) 

{P) 

(26) 

Opt(P) 

= max 

$(x^) 

= inf $(x\x^) 

(D) 



and a vector field E{x^,x^) = [Ei{x^,x^)-, E2{x^,x‘^)\ specified (in general, non-uniquely) by the relations 

\l{x^,x^) G Xi X X2 '■ Fi(x^,x^) G dx:i^(x^,X^), E2 {x^,x‘^) G da;2[—^{x^,x^)]. 

It is well known that F is monotone on X, and that weak solutions to the VI(X, F) are exactly the saddle 
points of $ on Xi X X 2 . These saddle points exist if and only if (P) and {D) are solvable with equal optimal 
values, in which case the saddle points are exactly the pairs {x\,x 1 ) comprised by optimal solutions to (P) 
and {D). In general, Opt(P) > Opt(P), with equality definitely taking place when at least one of the sets 
Xi,X2 is bounded; if both are bounded, saddle points do exist. To avoid unnecessary complications, from 
now on, when speaking about a convex-concave saddle point problem, we assume that the problem is proper, 
meaning that Opt(P) and Opt(P) are reals; this definitely is the case when X is bounded. 

A natural (in)accuracy measure for a candidate x = x"^] G Xi x X2 to the role of a saddle point of d) 

is the quantity 

esad(a;|A:i, A2,$) = $(a;^) - $(a;^) 

= [<I)(a:^) - Opt(P)] -b [Opt(P) - <i>(a;^)] + [Opt(P) - Opt(P)] (27) 

'-V-^ 

>0 

This inaccuracy is nonnegative and is the sum of the duality gap Opt(P) — Opt(P) (always nonnegative and 
vanishing when one of the sets Ai,A 2 is bounded) and the inaccuracies, in terms of respective objectives, 
of x^ as a candidate solution to (P) and x"^ as a candidate solution to [D). 

The role of accuracy certificates in convex-concave saddle point problems stems from the following ob¬ 
servation: 

Proposition A.2. Let Xi,X2 be nonempty closed convex sets, $ : A := Ai x A 2 —> R 6e o locally Lipschitz 
continuous convex-concave function, and F be the associated monotone vector field on X. 

Let It = {xr = G A, P(a;i-)}(._]^ be a t-step execution protocol associated with (X,F) and X* = 

{\\.Yt=i ^6 associated accuracy certificate. Then x* := x*{It,X*) = G A. 

Assume, further, that A( C Ai and X2 C A 2 are closed convex sets such that 

X* G X' := X[ X XY (28) 

Then 

esad(a;*|A(, A 2 , $) = sup ^{x^’*,x^)— inf $(a;^, x^’*) < Res(A'|Pt, A*). (29) 

In addition, setting ^(x^) = sup,j,2gx' d>(x^,x^), for every x^ G A( we have 

$(xi’‘) - $(xi) < $(xi’‘) - $(x\x2’*) < Res({xi} x A'|Pt, A*). (30) 

In particular, when the problem Opt = min 2 ,igx( d)(x^) is solvable with an optimal solution xl, we have 

<I>(x^’‘) — Opt < Res({xi} X A2|Jt, A*). (31) 
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Proof. The inclusion x* G X is clear. For every set Y C X we have 


V[p; q]GY : 

Res(r|lt, A‘) > K + {P2{xl),xl - 9)] 

> Et=i K [[^(4.4) - ^4 4)] + [^(4: q) - ^(4,4)]] 

[by the origin of F and since $ is convex-concave] 

= E4i 4 [^(4. q) - ^4 4)] > q) - 4’*) 

[by origin of x* and since $ is convex-concave] 


Thus, for every Y G X we have 


sup [$(xi’*, q) - $(p, x^’*)] < Res(y|lt, A‘). 

lp\q]eY 


(32) 


Now assume that Condition (281 is satisfied. Setting Y = X' ■= X[ x X 2 , and recalling what Csad is, (32) 
yields (29). With Y = {x^} x X 2 (32) yields the second inequality in (30); the first inequality in (30) is clear 
since x^G X'^- □ 


B Theoretical analysis of composite Mirror Prox with inexact 
proximal mappings 

We restate the Theorem |3.1| b elow and the proof below. The theoretical convergence rate established in 
Theorem 3.1 and Corollary |3 .1 1 extends the previous result established in Corollary 3.1 in [12] for CMP with 
exact prox-mappings. Indeed, when exact prox-mappings are used, we recover the result of [Hj. When 
inexact prox-mappings are used, the errors due to the inexactness of the prox-mappings accumulates and is 


reflected in the bound (34) and (14). 


Theorem Assume that the sequence of step-sizes ( 74 ) in the CMP algorithm satisfy 

<Jt := lt{Fu{u^) - Fu{u*),u^ - 4+1) - 14 *(4+1) - Kt(4) < , t = 1,2,... ,r . (33) 

Then, denoting 0[X] = supj„..^,]g;if (u), for a sequence of inexact prox-mappings with inexactness et > 0, 
we have 

0[4 + M^ELi7? + 2EL4* 


evi(^T sup {F{x)^xt — x)< 

x£X 


Et=i 7* 


(34) 


Remarks Note that the assumption on the sequence of step-sizes ( 7 ^) is clearly satisfied when 7 ^ < 
(•\/2L)“i. When M = 0, it is satished as long as Gt ^ T“i. 

Proof. The proofs builds upon and extends the proof in m- For all u,u',w G U, we have the well-known 
identity 

{Vf{u'),w- 4) = Vu{w) - Vu'{w) - 14(4). (35) 

Indeed, the right hand side writes as 

[a;(r(;) — u}{u) — {uj' {u),w — u)] — [^(rc) — uj(u') — (w'(4), w — 4)] — [w(4) — uj{u) — {uj'{u),u' — m)] 

= {uj'{u), u — w) + {uj'{u), u' — u) + (w'(4), w — u') = {uj'{u') — uj' {u),w — 4) = {Vf{u'),w — 4). 

For X = [u-,v] G X, f = [ 7 ; C]) e 7 0, let [u'; v'] G Pfif). By definition, for all [s; ic] G X, the inequality holds 

{ri+ Vf(u'),u' -s) + (C,u' -w) <e, 


which by (35) implies that 

( 7,4 - s) + {C,v' -w)< {Vf{u'),s- u') +e = Vu{s) - Vu'{s) - 14(4) -I- e. 


(36) 
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When applying Q with e = e*, [u; z;] = [u*-,v*] = x\ i 7tF„], [u’]v'] = [u*-,v*] = y\ 

(37) 


and [s; wl = = x*^^ we obtain 


t+TT^t+l] 

7t[(-?"«(«*),«* - + {Fy^v'*- - - bst(u‘+^) - K‘(u‘) + et 


and applying (36) with e = et, [M;n] = x*, ^ = ')tF{y*), [u’]v’] = and [s;w] = z € X we get 
')t[{Fu{u*),u*"^^ - s) + (F„,'y‘+^ - w)] < 14*(s) - 14*+i(s) - + e* ■ 


Adding (38) to (37), we obtain for every z = [s]w] G X 

lt{F{y*),y* -z)= 7t[(i4(u‘),u‘ - s) + - w)] 

< 14 * (s) — t 4 t+i(s) + CTj + 2 e( , 

with 

CTt := jtiFuiu*) - J4(m*),m* - M*+^) - t4t(M*+^) - t4t(ut) . 
Due to the strong convexity, with modulus 1, of t4(') w.r.t. || • ||) have for all u,u 


Vu{u) > ^||m- 


Therefore, 


o-t < 7t||^**(w‘) - i4(r 




< fM)\\ 1 -\\u^ -u%^] 

< l[^l[MFL\\u^-u%f-\\u^-n^], 

where the last inequality follows from Assumption A. 3. Note that 7 tL < 1 implies that 

7 *^[M + L|lu‘ - - ||u‘ - u‘f < max [ 7 *^[M + Lr]^ - r^] = ^44^- 


(38) 


(39) 


Let us assume that the step-sizes 74 > 0 are chosen so that (33) holds, that is at < 7 ^M^. It is indeed 
the case when 0 < 74 < when M = 0, we can take also 7 t < j. Summing up inequalities (39) over 
t = 1,2, and taking into account that 14 *+* (s) 4 0, we finally conclude that for all z = [s;?c] & X, 


'X^\t/zpft\ t \ ^ (s) +Z)tLi 7? + 2 u \t _ \- 

/ , ^T{F{y ),y — z) < --, where Xj' — (/ ^ 7 i) 

t=i Z^t=i 7t 


It 


i=l 


□ 


C Theoretical analysis of composite conditional gradient 

C.l Convergence rate 

The CCG algorithm enjoys a convergence rate in in the evaluations of the function 4)", and the 

accuracy certificates (4) enjoy the same rate 0(t“^”“^^) as well, for solving problems of type (15). 


Proposition 3.1, Denote D the |j • ||-c?Mmeter of U. When solving problems of type (15), the sequenee of 
iterates (x*) of CCG satisfies 


9 T T)'^ 

et := (l)+{x^) -vFm(j)+{x) < —^- . 

x£X k{3 — K.) + l 


K—l 


t > 2 


In addition, the accuracy certificates (5^) satisfy 


min 4 < 0 {l)LoD'^ 

\ Z J. 


,t>2 


(40) 


(41) 
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C.2 Proof of Proposition [3TT] 

1*^. The projection of X 2 onto is contained in t/2, whence 

||u2[V0(m2)] - U 2I! < D. 

This observation, due to the structure of implies that whenever x,x' € X and 7 S [0,1], we have 


^^{x + 7(x"'' — x)) < ^^(a;) + 7(V(/)"''(a;), x' — x) + 7'^. 

K 


Setting a:^ = a;| + 7s(a;2[V(/)(u®)] — a:|) and 7s2/(s + 1), we have 

5t+i < - min (j)'^{x 2 ) 


X2^X2 


LoD^ 


< +7s(V0(a;^),x[V(/)+(a:^)] - X 2 ) + -7^ 

K 

^ LoD^ ^ 

— Ss — 7sA H-7g, 


(42) 

(43) 

(44) 

(45) 


whence, due to As > 6 s > 0, 


(*) < (1 - 7s)<5. + — 7 ", s = 1,2, 

K 

(a) ^rAr < 6r - 6r+l + 7t : T = 1, 2, ... 

n 


(46) 


2°. Let us prove (40) by induction on s > 2. By (46 i) and due to 71 = 1 we have 62 < —7^—, whence 
<52 < ^^3°f^N 72~^ due to 72 = 2/3 and 1 < k < 2. Now assume that 5s < some t >2. Then, 

invoking (46 i), 

, ^ 2 LoD^ , LqD^ ^ 

(5,,+i < -r7^ ^(1 - 7s) +-7s 


< 


k(3 — k) 
2LoD 




K — 1 


-7s 


k{?> — k) 

O T T~\K 


Therefore, by convexity of (t + 1)^ in t 




2LnD^ 


cyK—lf^ I r)\l — K ^-^Q-Ly ^K — 1 

c(3-«d ‘ 


The induction is completed. 


3°. To prove (41), given s > 2, let s_ = Ceil(max[2, s/2]). Summing up inequalities (46 ii) over s_ < r < s, 
we get 






s+l 


5 ] 7 ." < 0{l)LoD^rs 


and X)t=s_ 7t — ^(1); ^^^d (41) follows. 


□ 
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D Semi-Proximal Mirror-Prox 


D.l Theoretical analysis for Semi-Proximal Mirror-Prox 

We first restate Proposition |3.2| and provide the proof below. 


Proposition 3.2 Under the assumption (^.1) — (^-4) and (5.1) — (5.3) with M = 0, for the outlined 

variational inequality VI{X,F), the total number of Mirror Prox 
and the total number of calls to the Linear Minimization Oracle 

'LqL^D’' 


algorithm to return an e-solution to the variational inequality VI{X,F), the total number of Mirror Prox 
steps required does not exceed O 
does not exceed 




JV = 0{1) 


0[X]. 


In particular, if we use Euclidean proximal setup on U 2 with uj 2 {-) = ^||a; 2 |P, which leads to k = 2 and 
Lq = 1, then the number of LMO calls does not exceed N = 0(1) {L"^D'^{Q[Xi\ + 0^) /e^. 


Proof. Let us fix N as the number of Mirror prox steps, and since M = 0, from Theorem 3.1 the efficiency 
estimate of the variational inequality implies that 


Lie[X]+2j:l,e,) 


eyiiF^\X,F) < 


N 


Let us fix et = for each t = then from Proposition 


3.1 


it takes at most s = 


LMO oracles to generate a point such that As < e*. Moreover, we have 


eyi[x^\X,F)<2 


L0[A] 
N ' 


Therefore, to ensure evi(a;'^|Al, F) < e for a given accuracy e > 0, the number of Mirror Prox steps N is at 
most ) and the number of LMO calls on X 2 needed is at most 


rLnD'‘^N\^/U-U / 


LnL'^D'^\^/U-U 


0[X] 

In particular, if k = 2 and Lq = 1, this quantity can be reduced to 

Af = 0(l)L^. 


em. 


□ 


D.2 Discussion of Semi-Proximal Mirror-Prox 

The proposed Semi-Proximal Mirror-Prox algorithm enjoys the optimal complexity bounds, i.e. 0(l/e^), 
in the number of calls to linear minimization oracle. Furthermore, Semi-Proximal Mirror-Prox generalizes 
previously proposed approaches and improves upon them in special cases of problem ([^. 

When there is no regularisation penalty, Semi-Proximal Mirror-Prox is more general than previous algo¬ 
rithms for solving the corresponding constrained non-smooth optimisation problem. Semi-Proximal Mirror- 
Prox does not require assumptions on favorable geometry of dual domains Z or simplicity of in Q. 
When the regularisation is simply a norm (with no operator in front of the argument), Semi-Proximal 
Mirror-Prox is competitive with previously proposed approaches |16L I24j based on smoothing techniques. 

When the regularisation penalty is non-trivial, Semi-Proximal Mirror-Prox is the first proximal-free or 
conditional-gradient-type optimization algorithm, up to our knowledge. 
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E Numerical experiments and implementation details 

E.l Matrix completion: £ 2 -fit +nuclear norm 

We first consider the the following type of matrix completion problem, 

min ll-Pna; - &II 2 + A||a:||nuc (47) 

where || • ||nuc stands for the nuclear norm and Pqx is the restriction of x onto the cells 


Competing algorithms. We compare the following three candidate algorithms, i) Semi-Proximal Mirror- 
Prox (Semi-MP) ; ii) conditional gradient after smoothing (Smooth-CG); hi) inexact accelerate proximal 
gradient after smoothing (Semi-SPG). We provide below the key steps of each algorithms. 


1 . 


Semi-MP: this is shorted for our Semi-Proximal Mirror-Prox algorithm, we solve the saddle point 
reformulation given by 


min max (Pqx — b,y) + Xv 

a:,'u:||a:||nuc<'« l|y|| 2 <l 


(48) 


which is equivalent as to the semi-structured variational inequality Semi-VI (X,F) with X = {[u = 
ix]y);v] : ||a;||nuc < v,\\y\\2 < 1} and F = [F„('u);F„] = [P^y]b - Pna:;A]. The subdomain Xi = 
{y ■ llylb < 1} is given by full-prox setup and the subdomain X2 = {(a:;?^) : ||a:||nuc < v} is given by 
LMO. By setting both the distance generating functions ujx{x) and ujy{y) as the Euclidean distance, 
the update of y reduces to a gradient step, and the update of x follows the composite conditional 
gradient routine over a simple quadratic problem. 


2 . 


Smooth-CG: The algorithm ([21]) directly applies the generalized composite conditional gradient on 
the following smoothed problem using the Nesterov smoothing technique. 


min /"^(x)-f Az;, where/^(a;) = {{Pqx - b,y) --\\y\\i}. 

a;.'«:||2:||nuc<'« l|y|| 2 <l ^ 


(49) 


Under the full memory version, the update of x at step t requires computing reoptimization problem 


2=1 


(50) 


where are the singular vectors collected from the linear minimization oracles. Same as 

suggested in [21j, we use the quasi-Newton solver L-BFGS-B to solve the above re-optimization 


subproblem. Notice that in this situation, solving (50) can be relatively efficient even for large t since 


computing the gradient of the objective in (501 does not necessarily need to compute out the full matrix 
representation of x = ■ 

3. Semi-SPG: The approach is to apply the accelerated proximal gradient to the smoothed composite 


model as in (49) and approximately solve the proximal mappings via conditional gradient routines. 
In fact, Semi-SPG can be considered as a direct extension of the conditional gradient sliding to the 
composite setting. Same as Semi-MP, the update of x is given by the composite conditional gradient 
routine over a simple quadratic problem and additional interpolation step. Since the Lipschitz constant 
is not known, the learning rate is selected through backtracking. 


For Semi-MP and Semi-SPG, we test two different strategies for the inexact prox-mappings, a)fixed inner 
GG steps and b)decaying ct = cjt as the theory suggested. For the sake of simplicity, we generate the 
synthetic data such that the magnitudes of the constant factors (i.e. Frobenius norm and nuclear norm of 
optimal solution) are approximately of order 1, which means the convergence rate is dominated mainly by the 
number of LMO calls. In Fig.|^ we evaluate the optimality gap of these algorithms with different parameters 


19 





(e.g. number of inner steps, scaling factor c, smoothness parameter 7 ) and compare their performance given 
the best-tuned parameter. As the plot shows, the Semi-MP algorithm generates a solution with e = 10“^ 
accuracy within about 3000 LMO calls, which is not bad at all given the fact that the worst complexity is 
0(l/e^). Also, the plots indicate that using the second strategy with 0{l/t) decaying inexactness provides 
better and more reliable performance than using fixed number of inner steps. Similar trends are observed 
for the Semi-SPG. One can see that these two algorithms based on inexact proximal mappings are notably 
faster than applying conditional gradient on the smoothed problem. Moreover, since the Smooth-CG requires 
additional computation and memory cost for the re-optimization procedure, the actual difference in terms 
of CPU time could be more significant. 





Number of LMO calls 



Number of LMO calls 


Figure 3: Matrix completion on synthetic data(1024 x 1024): optimality gap vs the LMO calls. 
From left to right: (a) Semi-MP; (b) Semi-SPG ; (c) Smooth-CG; (d) best of three algorithms. 


E.2 Robust collaborative fitering: £i-empirical risk +nuclear norm 


We consider the collaborative filtering problem, with a nuclear-norm regularisation penalty and an £ 1 - 
empirical risk function: 


\E\ ^ I 


-h 


■ A||a:|| 


(51) 


Competing algorithms. We compare the above three candidate algorithm. The smoothed problem for 
Semi-SPG and Smooth-CG in this case becomes 


min P{x) Xv, where = 

:||a:||nuc<i’ 


max 

llylU<i 


1 

\E\ 




bij)yij 



(52) 


Note that in this case, for Smooth-CG, solving the re-optimization problem in (501 at each iteration 
requires computing the full matrix representation for the gradient. For large t and large-scale problems, the 
computation cost for re-optimization is no longer negligible. However, the Semi-MP and Semi-SPG do not 
suffer from this limitation since the conditional gradient routines are called for simple quadratic subproblems. 
For this particular example, we implement the Semi-MP slightly different from the above scheme. We solve 
the following saddle point reformulation with properly selected p. 


min 

x,y,vi,V2- 

'«l>l|a:||nuc.i’2>l|y||l 


max V 2 + Xvi + p(Ax — b — y.w) 

||U,||2<1 


(53) 


where we use A to denote the operator The semi-structured variational inequality Semi-VI (V, F) 

associated with the above saddle point problem is given by A = {[m = {x,y,w);v = (ni.U 2 )] : ||a;||nuc < 
||ic ||2 < 1} and F = [F„(it);F„] = [pAw;—pw; p{y — Ax + b); X;l]- The subdomain Xi = 
{(y, w, V 2 ) ■ ||y|l 1 < V 2 , ||rc ||2 < 1} is given by full-prox setup and the subdomain X 2 = {(x; vi) : ||a;||nuc < 'Ci} 
is given by LMO. By setting both the distance generating functions as the Euclidean distance, the update 
of w reduces to the gradient step, the update of y reduces to the soft-thresholding operator, and the update 
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of X is given by the composite conditonal gradient routine. In our experiment, the factor p is updated 
adaptively in such a way that the back-projection step does not increase the objective function value. We set 
the stepsizes 74 along the iterations using line-search. All in all, the Semi-Proximal Mirror-Prox algorithm 
(Semi-MP) is fully automatic, and does not require tuning of any parameter. 

We run the above three algorithms on the the small and medium MovieLens datasets. The small-size 
dataset consists of 943 users and 1682 movies with about lOOK ratings,while the medium-size dataset consists 
of 3952 users and 6040 movies with about IM ratings. We follow [21] to set the regularisation parameters. We 
randomly pick 80% of the entries to build the training dataset, and compute the normalized mean absolute 
error (NMAE) on the remaining test dataset. For Smooth-CG, we carry out the algorithm with different 
smoothing parameters, ranging from {le — 3, le — 2, le — 1, leO} and select the one with the best performance. 
For the Semi-SPG algorithm, we adopt the best smoothing parameter found in Smooth-GG. We use two 
different strategies to control the number of LMO calls at each iteration, i.e. the accuracy of the proximal 
mapping for both Semi-SPG and Semi-MP, which are a) fixed inner GG steps and b) decaying e* = cjt as the 
theory suggested. We report in Fig. |^and Fig. [^the performance of each algorithm under different choice 
of parameters and the overall comparison of objective value and NMAF on test data in Fig. 






Figure 4: Robust collaborative filtering on MovieLens lOOK: objective function vs elapsed time. 
From left to right: (a) Semi-MP; (b) Semi-SPG ; (c) Smooth-GG; (d) best of three algorithms. 






Figure 5: Robust collaborative filtering on MovieLens IM: objective function vs elasped time. 
From left to right: (a) Semi-MP; (b) Semi-SPG ; (c) Smooth-GG; (d) best of three algorithms. 






Figure 6 : Robust collaborative filtering on Movie Lens: objective function and test NMAF against elapsed 
time. From left to right: (a) MovieLens lOOK objective; (b) MovieLens lOOK test NMAE; (c) MovieLens 
IM objective; (d) MovieLens IM test NMAE. 

In Fig.j^and Fig.j^ we can see that using fixed inner GG steps sometimes achieve comparable performance 
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as using the decaying epsilon et- In Fig.[^ we can see that Semi-MP clearly outperforms Smooth-CG, while 
it is competitive with Semi-SPG. In the large-scale setting, Semi-MP achieves better objective as well as test 
NMAE compared to Smooth-CG. 


E.3 Link prediction: hinge loss + £i-norm + nuclear norm 

We consider the following model for the link prediction problem, 

mm ^X! max(l - - 0.5)a;y,0)-f Ai||x||i-f A 2 ||a;|| 


nuc 


(54) 


This example is more complicated than the previous two examples since it has not only one nonsmooth loss 
function but also two regularization terms. Applying the smoothing-CG or Semi-SPG would require to build 
two smooth approximations, one for hinge loss term and one for ii norm term. Therefore, we consider another 
alternative approach, Semi-LPADMM, where we apply the linearized preconditioned ADMM algorithm by 
solving proximal mapping through conditional gradient routines. Up to our knowledge, ADMM with early 
stopping is not well-analyzed in literature, but intuitively as long as the accumulated error is controlled 
sufficiently, the variant will converge. 

We conduct experiments on a binary social graph data set called Wikivote, which consists of 7118 nodes 
and 103,747 edges. Since the computation cost of these two algorithms mainly come from the LMO calls, 
we present in below the performance in terms of number of LMO calls. For the first set of experiments, we 
select top 1024 highest degree users from Wikivote and run the two algorithms on this small dataset with 
different strategies for the inner LMO calls. 

In Fig.[^ we observe that the Semi-MP is less sensitive to the inner accuracies of prox-mappings compared 
to the ADMM variant, which sometimes stop progressing if the prox mapping of early iterations are not solved 
with sufficient accuracy. Another observation is that in this example, the second strategy, which essentially 
saves the use of LMOs, works better in the long run than using fixed number of LMOs. The results indicate 
again on the full dataset again indicates that our algorithm performs better than the semi-proximal variant 
of ADMM algorithm. 
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Figure 7: Link prediction on Wikivote: objective function value against the LMO calls. From left to right: 
(a)Wikivote(1024) with fixed inner steps; (b) Wikivote(1024) with et = cjt] (c) Wikivote(full) 
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